Machine Learning Intro

Spring R ’24

Dominic Bordelon, Research Data Librarian, ULS

Agenda

  1. What is machine learning?
  2. Supervised learning
    • Regression, \(K\)-nearest neighbors, decision trees…
  3. Model assessment
  4. Unsupervised learning
    • Principal component analysis, \(K\)-means clustering
  5. Reinforcement learning: learning with rewards

About the trainer

Dominic Bordelon, Research Data Librarian
University Library System, University of Pittsburgh
dbordelon@pitt.edu

Services for the Pitt community:

  • Consultations
  • Training (on-request and via public workshops)
  • Talks (on-request and publicly)
  • Research collaboration

Support areas and interests:

  • Computer programming fundamentals, esp. for data processing and analysis
  • Open Science and Data Sharing
  • Data stewardship/curation
  • Research methods; science and technology studies

Today’s packages

  • {rsample}, part of {tidymodels}, for train/test splits
  • {recipes}, part of {tidymodels}, for preprocessing
  • {naivebayes}
  • {rpart} and {rpart.plot} for decision trees
  • {kknn} for K Nearest Neighbors
  • {tidyclust}, part of {tidymodels}, for K-Means clustering
#install.packages(c("tidymodels", "naivebayes"))

💡 In terms of writing code, there are a variety of approaches to modeling in R, even for fitting the same type of model (e.g., when implemented by different package developers). We will favor the tidymodels approach.

…and using penguins examples

library(tidyverse)
library(palmerpenguins)
data(penguins)
names(penguins)

What is machine learning (ML)? and supervised vs. unsupervised?

What is (supervised) machine learning?

  • Using the computer to learn from data (yes, it is that simple)
  • Applies statistics (finding information in data) and computer science (algorithmic design and implementation)
  • ML is not quite the same as AI (but AI is built on ML)
  • Supervised ML has a has a predictive output or target (regression or classification of a variable). A model is fit which predicts (or retrodicts) some \(y\) from one or more \(x\).
  • Unsupervised ML (next week) does not have a “target;” these methods tend to look for patterns in the data (e.g., clustering)

ML (and statistics) terminology

Often describes statistical concepts with different language, due to separate disciplinary traditions.

Some terminology encountered in ML and statistics. Source: Adapted from Zachary Kurtz, “Translating Between Statistics and Machine Learning” (2018)
Statistics term ML / Computer Science term
observation, case example, instance
response variable, dependent variable label, output
predictor, independent variable feature, input
regression regression, supervised learner, machine
estimation learning

Terminology

⚠ Terms/concepts to be careful with in ML, coming from stats:

  • hypothesis (sometimes an output of a classifier model)

  • bias (broader meaning)

  • causality (sometimes less rigorous than stats)

Supervised methods

Some we have already seen

  • Simple linear regression
  • Logistic regression
  • Multiple linear regression

Regression and classification

Regression

  • Understand a numeric / continuous variable’s relationship to one or more predictors
  • or, Predict some numeric / continuous value from observations
  • How tall (\(y\)) will my plant be if I give it \(x\) amount of water?”

Classification

  • Understand a categorical variable’s relationship to one or more predictors
  • or, Predict some numeric / continuous value from observations
  • What kind of seeds (\(y\)) did I plant if the current height is \(x\)?”

Model testing / validation

  • ML gives us many ways to do the same tasks (regress, classify).
  • With supervised methods, we can furthermore assess each model’s performance by comparing the prediction to observed output values. These comparisons are expressed with metrics such as \(R^2\).
  • However, to be confident that we are not overfitting, we should split the data in some proportion (e.g., 80/20) prior to fitting. Then we can assess not only how the model fits the “training” (fitting) data, but also how it fits previously-unseen data, called “test” data
  • Let’s split the penguins 80/20
  • There are much more sophisticated ways of validation, such as fitting many models on subsets of the data and averaging their results
library(tidymodels) 

# setting seed ensures that "random" behavior is reproducible, i.e. the same every time
set.seed(1234)
penguins_clean <- penguins %>% 
  drop_na()
pens_split <- initial_split(penguins_clean, prop = 0.8)
pens_train <- training(pens_split)
pens_test <- testing(pens_split)

Naive Bayes classification

Naive Bayes classifier

  • We assume that each predictor variable has a normal (Gaussian) distribution
    • For each variable, the distribution is estimated based on observed data
    • To classify an observation, plot it and check its probability of membership for each class.

Animation of the naive Bayes classifier. Color intensity indicates probability of group membership. Image source: Jacopo Bertolotti via Wikimedia Commons (CC0)

Naive Bayes examples

Naive Bayes code example

library(naivebayes)

# predict sex from species and body mass:

sex_fit <- naive_bayes(sex ~ species + body_mass_g, data = pens_train)

library(klaR)
plot(sex_fit)
predict(sex_fit, data.frame(body_mass_g=3500, species="Gentoo"))

Decision trees

Decision trees

  • We stratify the predictor space using splitting rules
    • Recursive binary splitting:
      select the \(X\) most correlated with \(Y\) \(\rightarrow\)
      find a good “cut point” (decision boundary) to split the data, according to \(Y\) \(\rightarrow\)
      in the two resulting bins, the process is repeated, and so on, until a stop condition is reached.
  • May be used for classification but also regression
  • Very explainable, but “typically are not competitive with the best supervised learning methods” in terms of accuracy (James et al. 2021); also not very robust on their own
    • 💡 Methods like random forests, bagging, and boosting extend the decision tree algorithm and address issues of accuracy and robustness—popular

Animation of a simple decision tree example. Each binary branch in the tree on the left corresponds to a partitioning in the x-y space. The response variable (output) of this model is gray/green color classification. Image source: Algobeans

Image source: James et al. 2021

Decision tree code example

library(rpart)
library(rpart.plot)

pens_tree <- rpart(sex ~ species + body_mass_g, data = pens_train)x
rpart.plot(pens_tree)

K Nearest Neighbors

K Nearest Neighbors

  • A classifier
  • “For my observation of interest, which class do the K nearest neighbors belong to?”
    • K = chosen by analyst
    • “Nearest neighbor” is defined using distance in the predictor space

K Nearest Neighbors

Effects of changing K.

K Nearest Neighbors code example

library(kknn)

# centering and scaling data:
pens_recipe <- recipe(sex ~ species + body_mass_g, data = pens_train) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_center(-sex) %>% 
  step_scale(-sex) %>% 
  prep()

pens_train_juiced <- juice(pens_recipe)

# visualize processed data:
pens_train_juiced %>% pivot_longer(-sex) %>% 
    ggplot() +
    geom_histogram(aes(value, fill = sex)) +
    facet_wrap(~name)

baked_test <- bake(pens_recipe, new_data = pens_test)

Fitting

knn_spec <- nearest_neighbor() %>% 
  set_engine("kknn") %>% 
  set_mode("classification")

knn_fit <- knn_spec %>% 
  fit(sex ~ ., 
      data = pens_train_juiced)

knn_fit

Metrics:

knn_fit %>% 
    predict(baked_test) %>% 
    bind_cols(baked_test) %>% 
    metrics(truth = sex, estimate = .pred_class)

Unsupervised methods

Unsupervised learning

Unsupervised learning has no predictive model: instead it finds previously unknown structure in the data. All variables or features of the data are considered together.

Unsupervised learning tends to be most useful for exploratory data analysis, i.e., prior to having a goal for regression or classification.

  • Principal Components Analysis (PCA): dimensionality reduction
  • \(K\)-Means Clustering

Dimensionality reduction

Principal components analysis (PCA)

  • \(X_1, X_2,…,X_p\) features are reduced to a small number of “principal components”
    • \(Z_1\), the first principal component, accounts for most of the variation in the data
    • \(Z_2\) accounts for most of the remaining variation
  • Example of dimensionality reduction 🤏
  • Useful for:
    • understanding
    • deriving variables for supervised methods
    • visualizing 3+ variables in a 2D space
    • imputation (guessing empty values)
  • Similar to Linear Discriminant Analysis (LDA), a supervised method which incorporates dimensionality reduction among predictors

Animation demonstrating projection of two features onto a single histogram using principal components analysis. Image source: Amélia O. F. da S. via Wikimedia Commons (CC BY-SA 4.0)

Image source: James et al. 2021

PCA examples

PCA code example

PCA is a recipe step.

Read more here for important details: https://recipes.tidymodels.org/reference/step_pca.html

pca_recipe <- pens_recipe %>% 
  step_pca(all_numeric_predictors(), num_comp = 3) %>% 
  prep()

pca_juiced <- juice(pens_recipe)
pca_juiced

Clustering

K-Means Clustering

  • A number \(K\) of previously unknown clusters are identified; or: we partition the data into \(K\) clusters
    • \(K\) is chosen by the analyst
    • Clusters aren’t completely random, but based on similarities in the data
    • Assign every observation a number 1 through \(K\) \(\rightarrow\)
      calculate each cluster centroid \(\rightarrow\)
      (re)assign each observation to the closest centroid \(\rightarrow\)
      continue calculating and reassigning until movement stops
    • Total within-cluster variation is minimized
  • Not a predictive classifier!
  • Useful for:
    • Exploration
    • Identifying potential subpopulations

Animation of the K-means algorithm in action. After initial random group assignment, centroids are randomly placed and used to classify. Then centroids and assignment are iteratively adjusted until movement stops. Image source: Chire on Wikimedia Commons (CC BY-SA 4.0)

150 observations in 2D space, clustered according to different values of K. Prior to clustering, data are not categorized. Colors indicate which group each observation is assigned to by the model. Image source: James et al. 2021

K-Means examples

  • Identifying patients at risk for paternal age-related schizophrenia (Lee et al. 2011)

K-Means code example

library(tidyclust)
kmeans_spec <- k_means(num_clusters = 4)
kmeans_spec

kmeans_fit <- kmeans_spec %>% 
  fit(~ body_mass_g + bill_length_mm, data = pens_train)

kmeans_fit %>% extract_fit_summary()
kmeans_fit %>% extract_cluster_assignment()

Hierarchical clustering

  • Hierarchical clustering splits the dataset into nested groups, until all observations are in a cluster of 1
    • This splitting approach is divisive hierarchical clustering
  • The result is a tree (dendogram), and we can assess where we want to “cut” and how many categories will be created.
  • “Distance” (defined various ways) is measured between clusters, in order to make the splitting/merging decisions.
  • Agglomerative hierarchical clustering starts with all points as clusters of size 1, and merges them until reaching the “root” of the tree

Hierarchical clustering example. Cutting vertically, at different points along the x axis, will create different numbers of clusters. Image source: Wikimedia Commons

Hierarchical clustering code example

hc_spec <- hier_clust(
  num_clusters = 3,
  linkage_method = "average"
)
hc_spec

hc_fit <- hc_spec %>%
  fit(~ body_mass_g + bill_length_mm,
    data = penguins
  )

hc_fit %>%
  summary()

hc_fit$fit %>% plot()

Using model results

  • Developing familiarity with the data
  • For “labeling”—applying classes to observations
    • In a subsequent step, if we treat these classes as ground truth, we could also create a classifier! (supervised method)
  • For data cleaning (anomaly detection)
  • For investigation of interesting cases (anomaly detection)
  • For visualization (dimensionality reduction)

How to choose a ML method?

  • What do you want to do? (regress, classify, reduce number of variables, cluster, find anomalies)
  • What special properties do your data have? (Very skewed? Many missing values? Many covariates? Etc.)
  • What approaches have appeared in literature?
  • What would make for an interesting data exploration or question? (What ideas can I steal from [science journalism of] other fields?)

or, looking for discipline-specific R?

Check out the Big Book of R! An online directory at https://www.bigbookofr.com/ of very many R ebooks, most of them free OER and produced by experts, organized by discipline/topic and searchable.

Look up your discipline (or some topic that interests you, e.g., time series data) and see what applications of R you can find.

Example graphic of a recent update

Concluding thoughts

  • Every approach makes certain assumptions and tradeoffs
    • Bias–variance tradeoff: two sources of error; having a model that is not overfit to the data nor too general
    • Consider accuracy vs. explainability (“performance vs. complexity”)
  • Experiment, but don’t jump to conclusions. Consult your discipline’s literature. Look for alternative ways to explore and validate unexpected findings.

A common conceptualization in the (applied) ML community. Image source: Herm et al. 2023